For this lab, we are considering the diamond dataset to develop a model for predicting the price of a diamond.
The dataset is a collection of approximately 54,000 observations regarding different aspects and traits of a diamond.
This dataset was sourced from kaggle, though the data originated from Tiffany & Co.'s 2017 snapshot pricelist.
With this dataset, we hope to create a model that can accurately predict the price of a diamond when considering different aspects and traits of the gemstone. This model is meant to automate and speed up the process of determining a diamond's price by its features. This dataset includes four of the most important qualities to determine a diamond's value: color, clarity, cut, and carat weight. These factors, along with the others, help to ascertain the rarity of a diamond, therefore the value and price. Our model will take in the diamond data and make a prediction on the price of a diamond. The different aspects of a diamond can significiantly change its price range, making it very useful for companies involved in the diamond industry, whether they are making diamonds or selling them. Our model would be most useful in offline analysis, once all the information on the diamond has been determined and collected.
We hope that our prediction model will achieve an accuracy of 95%.
# import necessary libraries
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import plotly
import copy
import cv2
import os
from sklearn import metrics as mt
# Add name of dataset to absolute path
data_directory = os.getcwd()
# import csv file
df = pd.read_csv(data_directory + '\diamonds.csv')
df.head()
| carat | cut | color | clarity | depth | table | price | 'x' | 'y' | 'z' | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | b'Ideal' | b'E' | b'SI2' | 61.5 | 55.0 | 326.0 | 3.95 | 3.98 | 2.43 |
| 1 | 0.21 | b'Premium' | b'E' | b'SI1' | 59.8 | 61.0 | 326.0 | 3.89 | 3.84 | 2.31 |
| 2 | 0.23 | b'Good' | b'E' | b'VS1' | 56.9 | 65.0 | 327.0 | 4.05 | 4.07 | 2.31 |
| 3 | 0.29 | b'Premium' | b'I' | b'VS2' | 62.4 | 58.0 | 334.0 | 4.20 | 4.23 | 2.63 |
| 4 | 0.31 | b'Good' | b'J' | b'SI2' | 63.3 | 58.0 | 335.0 | 4.34 | 4.35 | 2.75 |
It should be acknowledged that there are limitations with this dataset. As we progress through cleaning the dataset, we will be able to better explain the specific limitations of the dataset.
Let's see the size of the dataset and the different datatypes of its features.
print(f"Number of observations in training data: {df.shape[0]}")
print(f"Number of features in training data: {df.shape[1]}\n\n")
print(df.dtypes)
Number of observations in training data: 53940 Number of features in training data: 10 carat float64 cut object color object clarity object depth float64 table float64 price float64 'x' float64 'y' float64 'z' float64 dtype: object
There are 10 features in this dataset. The features are:
carat: The weight of the diamondcut: How the diamond's edges have been cutcolor: The color of the diamondclarity: The measurement of the diamond's purity and raritydepth (should be depth percentage): the percentage of light interacting with a diamond's edgestable: The length of a diamond's flat topprice: The price of a diamond after considering its attributes (Target Label)x: length of diamond (mm)y: width of diamond (mm)z: depth of diamond (mm)Althought these charcateristics are important whe determining the price of a diamond, a limit of this dataset is that it only considers these charcateristics. One factor to consider would be the diamond's country of origin, which can influence the quality and price of a diamond.
Another limitation is that there is only roughly 53,000 diamond observations. Though this may seem plenty, our model would benefit from additional data for training.
An additional limitation is that this dataset was taken from a U.S company (Tiffany & Co.), which may not be representative of the overall diamond market and supply.
Next, let's rename the x,y,z, and depth so that the column name is more representative of the meaning of the values. Renaming the columns is more intuitive for readers to understand the values in the dataset.
For the depth column (not z, which is the actual depth), it is wrongly named. It is the depth percentage of a diamond, which indicates the "depth of a diamond in relation to its width" (With Clarity). Depth Percentage is calculated by dividing total depth by the average diameter, then multiplying by 100.
df = df.rename(columns={'depth': 'depth_percentage'})
Since the x, y, and z features represent the length, width, and depth of a diamond, respectively, we will change the columns names accordingly.
# changing column names
df = df.rename(columns={"'x'": 'length'})
df = df.rename(columns={"'y'": 'width'})
df = df.rename(columns={"'z'": 'depth'})
print(df.dtypes)
carat float64 cut object color object clarity object depth_percentage float64 table float64 price float64 length float64 width float64 depth float64 dtype: object
Renaming the columns is more intuitive for readers to understand the values in the dataset.
Next, we will perform actual dataset cleaning. Here, we can consider additional limitations such as data quality issues and inconsistencies.
We will check if there any any null values in our dataset. Having too many null values can impact the performance of our model.
# check for missing value
df.isnull().sum()
carat 0 cut 0 color 0 clarity 0 depth_percentage 0 table 0 price 0 length 0 width 0 depth 0 dtype: int64
There are no null values! We won't have to worry about missing data with this dataset. However, we should check for any unusual values such as 0's.
# check for 0s in each column
df.isin([0]).sum()
carat 0 cut 0 color 0 clarity 0 depth_percentage 0 table 0 price 0 length 8 width 7 depth 20 dtype: int64
We observe that there are 8 zeros in the length column, 7 zeros in the width column, and 20 zeros in the depth column, while the rest of the features report no missing columns. Let's print out the rows that have zeros along the length, width, and depth column.
# create new variable: price_range
df['price_range'] = pd.cut(df['price'],
[300,900,2000,4000,6500,1e6],
labels=['0','1','2','3','4'])
df.price_range.describe()
# convert price_range to numeric
df["price_range"]=df["price_range"].astype(str).astype(int)
# drop the price column
df.drop(['price'], axis=1, inplace=True)
# now lets group with the new variable
df_grouped = df.groupby(by=['cut','price_range'])
print ("number of cut falling in specific price range:")
df_grouped.count()
number of cut falling in specific price range:
| carat | color | clarity | depth_percentage | table | length | width | depth | ||
|---|---|---|---|---|---|---|---|---|---|
| cut | price_range | ||||||||
| b'Fair' | 0 | 76 | 76 | 76 | 76 | 76 | 76 | 76 | 76 |
| 1 | 312 | 312 | 312 | 312 | 312 | 312 | 312 | 312 | |
| 2 | 591 | 591 | 591 | 591 | 591 | 591 | 591 | 591 | |
| 3 | 344 | 344 | 344 | 344 | 344 | 344 | 344 | 344 | |
| 4 | 287 | 287 | 287 | 287 | 287 | 287 | 287 | 287 | |
| b'Good' | 0 | 1011 | 1011 | 1011 | 1011 | 1011 | 1011 | 1011 | 1011 |
| 1 | 825 | 825 | 825 | 825 | 825 | 825 | 825 | 825 | |
| 2 | 1196 | 1196 | 1196 | 1196 | 1196 | 1196 | 1196 | 1196 | |
| 3 | 1063 | 1063 | 1063 | 1063 | 1063 | 1063 | 1063 | 1063 | |
| 4 | 811 | 811 | 811 | 811 | 811 | 811 | 811 | 811 | |
| b'Ideal' | 0 | 5690 | 5690 | 5690 | 5690 | 5690 | 5690 | 5690 | 5690 |
| 1 | 5912 | 5912 | 5912 | 5912 | 5912 | 5912 | 5912 | 5912 | |
| 2 | 3755 | 3755 | 3755 | 3755 | 3755 | 3755 | 3755 | 3755 | |
| 3 | 2577 | 2577 | 2577 | 2577 | 2577 | 2577 | 2577 | 2577 | |
| 4 | 3617 | 3617 | 3617 | 3617 | 3617 | 3617 | 3617 | 3617 | |
| b'Premium' | 0 | 2631 | 2631 | 2631 | 2631 | 2631 | 2631 | 2631 | 2631 |
| 1 | 2767 | 2767 | 2767 | 2767 | 2767 | 2767 | 2767 | 2767 | |
| 2 | 2281 | 2281 | 2281 | 2281 | 2281 | 2281 | 2281 | 2281 | |
| 3 | 2809 | 2809 | 2809 | 2809 | 2809 | 2809 | 2809 | 2809 | |
| 4 | 3303 | 3303 | 3303 | 3303 | 3303 | 3303 | 3303 | 3303 | |
| b'Very Good' | 0 | 2981 | 2981 | 2981 | 2981 | 2981 | 2981 | 2981 | 2981 |
| 1 | 2002 | 2002 | 2002 | 2002 | 2002 | 2002 | 2002 | 2002 | |
| 2 | 2531 | 2531 | 2531 | 2531 | 2531 | 2531 | 2531 | 2531 | |
| 3 | 2258 | 2258 | 2258 | 2258 | 2258 | 2258 | 2258 | 2258 | |
| 4 | 2310 | 2310 | 2310 | 2310 | 2310 | 2310 | 2310 | 2310 |
df["price_range"]=df["price_range"].astype(str).astype(int)
# print row index of rows with zeros in length column
df.loc[df['length'] == 0]
| carat | cut | color | clarity | depth_percentage | table | length | width | depth | price_range | |
|---|---|---|---|---|---|---|---|---|---|---|
| 11182 | 1.07 | b'Ideal' | b'F' | b'SI2' | 61.6 | 56.0 | 0.0 | 6.62 | 0.0 | 3 |
| 11963 | 1.00 | b'Very Good' | b'H' | b'VS2' | 63.3 | 53.0 | 0.0 | 0.00 | 0.0 | 3 |
| 15951 | 1.14 | b'Fair' | b'G' | b'VS1' | 57.5 | 67.0 | 0.0 | 0.00 | 0.0 | 3 |
| 24520 | 1.56 | b'Ideal' | b'G' | b'VS2' | 62.2 | 54.0 | 0.0 | 0.00 | 0.0 | 4 |
| 26243 | 1.20 | b'Premium' | b'D' | b'VVS1' | 62.1 | 59.0 | 0.0 | 0.00 | 0.0 | 4 |
| 27429 | 2.25 | b'Premium' | b'H' | b'SI2' | 62.8 | 59.0 | 0.0 | 0.00 | 0.0 | 4 |
| 49556 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.0 | 0.00 | 0.0 | 2 |
| 49557 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.0 | 0.00 | 0.0 | 2 |
# print row index of rows with zeros in width column
df.loc[df['width'] == 0]
| carat | cut | color | clarity | depth_percentage | table | length | width | depth | price_range | |
|---|---|---|---|---|---|---|---|---|---|---|
| 11963 | 1.00 | b'Very Good' | b'H' | b'VS2' | 63.3 | 53.0 | 0.0 | 0.0 | 0.0 | 3 |
| 15951 | 1.14 | b'Fair' | b'G' | b'VS1' | 57.5 | 67.0 | 0.0 | 0.0 | 0.0 | 3 |
| 24520 | 1.56 | b'Ideal' | b'G' | b'VS2' | 62.2 | 54.0 | 0.0 | 0.0 | 0.0 | 4 |
| 26243 | 1.20 | b'Premium' | b'D' | b'VVS1' | 62.1 | 59.0 | 0.0 | 0.0 | 0.0 | 4 |
| 27429 | 2.25 | b'Premium' | b'H' | b'SI2' | 62.8 | 59.0 | 0.0 | 0.0 | 0.0 | 4 |
| 49556 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.0 | 0.0 | 0.0 | 2 |
| 49557 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.0 | 0.0 | 0.0 | 2 |
# print row index of rows with zeros in depth column
df.loc[df['depth'] == 0]
| carat | cut | color | clarity | depth_percentage | table | length | width | depth | price_range | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2207 | 1.00 | b'Premium' | b'G' | b'SI2' | 59.1 | 59.0 | 6.55 | 6.48 | 0.0 | 2 |
| 2314 | 1.01 | b'Premium' | b'H' | b'I1' | 58.1 | 59.0 | 6.66 | 6.60 | 0.0 | 2 |
| 4791 | 1.10 | b'Premium' | b'G' | b'SI2' | 63.0 | 59.0 | 6.50 | 6.47 | 0.0 | 2 |
| 5471 | 1.01 | b'Premium' | b'F' | b'SI2' | 59.2 | 58.0 | 6.50 | 6.47 | 0.0 | 2 |
| 10167 | 1.50 | b'Good' | b'G' | b'I1' | 64.0 | 61.0 | 7.15 | 7.04 | 0.0 | 3 |
| 11182 | 1.07 | b'Ideal' | b'F' | b'SI2' | 61.6 | 56.0 | 0.00 | 6.62 | 0.0 | 3 |
| 11963 | 1.00 | b'Very Good' | b'H' | b'VS2' | 63.3 | 53.0 | 0.00 | 0.00 | 0.0 | 3 |
| 13601 | 1.15 | b'Ideal' | b'G' | b'VS2' | 59.2 | 56.0 | 6.88 | 6.83 | 0.0 | 3 |
| 15951 | 1.14 | b'Fair' | b'G' | b'VS1' | 57.5 | 67.0 | 0.00 | 0.00 | 0.0 | 3 |
| 24394 | 2.18 | b'Premium' | b'H' | b'SI2' | 59.4 | 61.0 | 8.49 | 8.45 | 0.0 | 4 |
| 24520 | 1.56 | b'Ideal' | b'G' | b'VS2' | 62.2 | 54.0 | 0.00 | 0.00 | 0.0 | 4 |
| 26123 | 2.25 | b'Premium' | b'I' | b'SI1' | 61.3 | 58.0 | 8.52 | 8.42 | 0.0 | 4 |
| 26243 | 1.20 | b'Premium' | b'D' | b'VVS1' | 62.1 | 59.0 | 0.00 | 0.00 | 0.0 | 4 |
| 27112 | 2.20 | b'Premium' | b'H' | b'SI1' | 61.2 | 59.0 | 8.42 | 8.37 | 0.0 | 4 |
| 27429 | 2.25 | b'Premium' | b'H' | b'SI2' | 62.8 | 59.0 | 0.00 | 0.00 | 0.0 | 4 |
| 27503 | 2.02 | b'Premium' | b'H' | b'VS2' | 62.7 | 53.0 | 8.02 | 7.95 | 0.0 | 4 |
| 27739 | 2.80 | b'Good' | b'G' | b'SI2' | 63.8 | 58.0 | 8.90 | 8.85 | 0.0 | 4 |
| 49556 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.00 | 0.00 | 0.0 | 2 |
| 49557 | 0.71 | b'Good' | b'F' | b'SI2' | 64.1 | 60.0 | 0.00 | 0.00 | 0.0 | 2 |
| 51506 | 1.12 | b'Premium' | b'G' | b'I1' | 60.4 | 59.0 | 6.71 | 6.67 | 0.0 | 2 |
These inconsistences are data quality issues, since it is impossible for a diamond to have 0 length and have an 'Ideal' or 'Fair' cut type, hypothetically.
# print out total number of rows with zeros in length, width, and depth column
print(f"Total number of rows with zeros in length, width, and depth column: {df[df['length'] == 0].shape[0] + df[df['width'] == 0].shape[0] + df[df['depth'] == 0].shape[0]}")
# print out percentage of rows with zeros in length, width, and depth column
print(f"Percentage of rows with zeros in length, width, and depth column: {(df[df['length'] == 0].shape[0] + df[df['width'] == 0].shape[0] + df[df['depth'] == 0].shape[0]) / df.shape[0] * 100}%")
Total number of rows with zeros in length, width, and depth column: 35 Percentage of rows with zeros in length, width, and depth column: 0.06488691138301816%
df.loc[df['length'] == 0, 'length'] = df['length'].median()
df.loc[df['depth'] == 0, 'depth'] = df['depth'].median()
df.loc[df['width'] == 0, 'width'] = df['width'].median()
Now, let's check for duplicate rows in our dataset. In Kaggle, the author did mention that there are duplicate observations, and that the x, y, z, table, and depth, columns should be disregarded when finding the duplicates, since they depend on the angle when these values were measured.
# drop duplicates
df.drop_duplicates(keep = False, inplace=True)
print(f"Number of observations after deleting duplicates: {df.shape[0]}")
Number of observations after deleting duplicates: 53299
Deleting the duplicates will reduce the computational power and time needed to go through this dataset.
Now that we have cleaned the dataset, let us observe the dataset statistics.
# dataset statistics
df.describe()
| carat | depth_percentage | table | length | width | depth | price_range | |
|---|---|---|---|---|---|---|---|
| count | 53299.000000 | 53299.000000 | 53299.000000 | 53299.000000 | 53299.000000 | 53299.000000 | 53299.000000 |
| mean | 0.796621 | 61.747412 | 57.455065 | 5.729599 | 5.733324 | 3.538592 | 1.871029 |
| std | 0.472033 | 1.425659 | 2.230042 | 1.116808 | 1.138125 | 0.700867 | 1.433017 |
| min | 0.200000 | 43.000000 | 43.000000 | 3.730000 | 3.680000 | 1.070000 | 0.000000 |
| 25% | 0.400000 | 61.000000 | 56.000000 | 4.710000 | 4.720000 | 2.910000 | 1.000000 |
| 50% | 0.700000 | 61.800000 | 57.000000 | 5.700000 | 5.710000 | 3.520000 | 2.000000 |
| 75% | 1.040000 | 62.500000 | 59.000000 | 6.540000 | 6.530000 | 4.030000 | 3.000000 |
| max | 5.010000 | 78.200000 | 95.000000 | 10.740000 | 58.900000 | 31.800000 | 4.000000 |
Focusing on standard deviation first, the highest one lies in the price column, which makes since given the large range of prices. The following highest standard deviation is in the table section, meaning that the values spread quite far from the mean value of the column. The rest of the columns except for depth and carat, cluster around the value 1. The depth and carat columns have the lowest standard deviation values, meaning that their values are close to the average mean. For the carat feature, this highlights another limitation of the dataset, since the diamonds do not vary much in their weight. When observing the max value of the carat column, the heaviest diamond is only 5 carats, which is not representative of the overall diamond supply.
To help us with additional observations, let us visualize the feature values.
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
sns.pairplot(df, hue = "carat", height = 2)
<seaborn.axisgrid.PairGrid at 0x26fbf6aadc0>
The diagonal graphs represent histograms of a singular feature, while the non-diagonal graphs represent a scatterplot between the intersection of two features. Focusing on the histograms, we can see how all of them except for the length histogram are very narrow, meaning their values are very close to the mean. This can become a limitation since, when the model is being tested, it is more likely for the prediction to be wrong if, for example, the depth value of the test diamond falls outside the limited range of the depth feature.
There are countless observations to be made with these graphs, but we will move on with cleaning the dataset.
import seaborn as sns
cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(8, 8))
sns.heatmap(df.corr(), cmap=cmap, annot=True)
C:\Users\Ibrahim\AppData\Local\Temp\ipykernel_5788\3684316204.py:5: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(df.corr(), cmap=cmap, annot=True)
<Axes: >
From the heatmap, we can observe multiple strong positive correlations:
The strongest negative correlation exists between table and depth percentage.
We have deleted all duplicate observations and others that had 0 values in their length, width and depth features. However, because our dataset contains ordinal features, we must go into detail about their relationship to the value of a diamond and the different ranks within each feature. We will One-Hot Encode the color, clarity, and cut columns.
According to the article "Diamond Color For Your Engagement Ring: An In-Depth Guide", the lack of color of a diamond is also considered when determining the quality of a diamond.
# print out unique color values
print(df['color'].unique())
["b'E'" "b'I'" "b'J'" "b'H'" "b'F'" "b'G'" "b'D'"]
The following list is the color categories ranked:
# one hot encode the color feature
if "color" in df.columns:
df = pd.get_dummies(df, columns=["color"])
# rename the color columns
df.rename(columns = {
"color_b\'D\'": "color_D",
"color_b\'E\'": "color_E",
"color_b\'F\'": "color_F",
"color_b\'G\'": "color_G",
"color_b\'H\'": "color_H",
"color_b\'I\'": "color_I",
"color_b\'J\'": "color_J"
}, inplace = True)
Clarity is the "measurement of purity and rarity" of the genstone. The less internal flaws, also known as a diamond inclusions, a diamond has, the higher grade it receives in clarity (Tiffany & Co).
# print out unique clarity values
print(df['clarity'].unique())
["b'SI2'" "b'SI1'" "b'VS1'" "b'VS2'" "b'VVS2'" "b'VVS1'" "b'I1'" "b'IF'"]
The following list is the clarity categories ranked:
# one hot encode the clarity feature
if "clarity" in df.columns:
df = pd.get_dummies(df, columns=["clarity"])
# rename the clarity columns
df.rename(columns = {
"clarity_b\'IF\'": "clarity_IF",
"clarity_b\'VVS1\'": "clarity_VVS1",
"clarity_b\'VVS2\'": "clarity_VVS2",
"clarity_b\'VS1\'": "clarity_VS1",
"clarity_b\'VS2\'": "clarity_VS2",
"clarity_b\'SI1\'": "clarity_SI1",
"clarity_b\'SI2\'": "clarity_SI2",
"clarity_b\'I1\'": "clarity_I1"
}, inplace = True)
A diamond's cut is reflects the gemstone's "final beauty and value" (GIA). The cut of a diamond determines how much light its internal structure is able to reflect, ultimately influencing its value.
# print out unique cut values
print(df['cut'].unique())
["b'Ideal'" "b'Premium'" "b'Good'" "b'Very Good'" "b'Fair'"]
The following list is the cut categories ranked:
# for each value in cut column, clean the string
if "cut" in df.columns:
df = pd.get_dummies(df, columns=["cut"])
# rename the cut columns
df.rename(columns = {
"cut_b\'Fair\'": "cut_Fair",
"cut_b\'Good\'": "cut_Good",
"cut_b\'Very Good\'": "cut_Very Good",
"cut_b\'Premium\'": "cut_Premium",
"cut_b\'Ideal\'": "cut_Ideal"
}, inplace = True)
Finally, we will turn our target variable into a nominal feature. It would be impossible to predict the price of a diamond with exact precision, down to the last dollar. Therefore, we will categorize the prices by certain ranges.
After cleaning our dataset, we have a total of 26 columns:
# print out columns
print(df.dtypes)
carat float64 depth_percentage float64 table float64 length float64 width float64 depth float64 price_range int32 color_D uint8 color_E uint8 color_F uint8 color_G uint8 color_H uint8 color_I uint8 color_J uint8 clarity_I1 uint8 clarity_IF uint8 clarity_SI1 uint8 clarity_SI2 uint8 clarity_VS1 uint8 clarity_VS2 uint8 clarity_VVS1 uint8 clarity_VVS2 uint8 cut_Fair uint8 cut_Good uint8 cut_Ideal uint8 cut_Premium uint8 cut_Very Good uint8 dtype: object
We kept carat, depth_percentage, table, length, width, and the depth column intact, excluding name changes.
For our target feature, originally price, now price_range, we changed it to a nominal variable:
We start our price range at $300 since the minimum value is $326.
From the cut column, we created five more:
From the color column, we created 7 more:
From the clarity column, we created 8 more:
Now that we've prepared the features we will need, and because all the dataset came from one file, we will need to split the dataset.
from sklearn.model_selection import ShuffleSplit
df2 = copy.deepcopy(df)
# we want to predict the X and y data as follows:
if 'price_range' in df:
# get target label
y = df['price_range'].values
# delete the price_range column from data
del df['price_range']
# standardize the data
norm_features = ['length', 'width', 'depth', 'carat', 'depth_percentage','table']
df[norm_features] = (df[norm_features]-df[norm_features].mean()) / df[norm_features].std()
# convert to numpy array
X = df.to_numpy()
# cross validation for model
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n_splits = num_cv_iterations, test_size = 0.2)
print(cv_object)
ShuffleSplit(n_splits=3, random_state=None, test_size=0.2, train_size=None)
plotly.offline.init_notebook_mode()
# create graph
graph1 = {'x': np.unique(y), 'y': np.bincount(y)/len(y), 'type': 'bar'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Multi Class Distribution',
'autosize':False,
'width':400,
'height':400}
# plot graph
plotly.offline.iplot(fig)
From the above graph, we observe that the split among the different price_range categories is not balanced. An 80/20 dataset split is most helpful when the classes are balanced. With this dataset however, we cannot recommend using the 80/20 dataset split technique.
Before we can implement the various custom models, we must first create a base class that will serve as the foundation for our other models.
from sklearn.preprocessing import StandardScaler
from sklearn import metrics as mt
X = StandardScaler().fit(X).transform(X)
from scipy.special import expit
# create base binary logistic regression class
class BinaryLogisticRegression:
def __init__(self, eta, iterations = 20, C = 0.001):
self.eta = eta
self.iters = iterations
self.C = C
def __str__(self):
if(hasattr(self,'w_')):
return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_)
else:
return 'Untrained Binary Logistic Regression Object'
@staticmethod
def _add_bias(X):
return np.hstack((np.ones((X.shape[0],1)),X))
@staticmethod
def _sigmoid(theta):
return expit(theta)
# vectorized gradient calculation with regularization using L2 Norm
def _get_gradient(self,X,y):
ydiff = y-self.predict_proba(X,add_bias=False).ravel()
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return gradient
def predict_proba(self,X,add_bias=True):
# add bias term if requested
Xb = self._add_bias(X) if add_bias else X
return self._sigmoid(Xb @ self.w_)
def predict(self,X):
return (self.predict_proba(X)>0.5)
def fit(self, X, y):
# add bias term
Xb = self._add_bias(X)
num_samples, num_features = Xb.shape
self.w_ = np.zeros((num_features,1))
# for as many as the max iterations
for _ in range(self.iters):
gradient = self._get_gradient(Xb,y)
# multiply by learning rate
self.w_ += gradient*self.eta
We will vectorize our Logistic Regression model since our model will be able to compute faster using vectors instead of for-loops.
class VectorBinaryLogisticRegression(BinaryLogisticRegression):
@staticmethod
def _sigmoid(theta):
return expit(theta)
def _get_gradient(self,X,y):
# get y difference
ydiff = y-self.predict_proba(X,add_bias=False).ravel()
# make ydiff a column vector and multiply through
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
return gradient.reshape(self.w_.shape)
from scipy.optimize import fmin_bfgs
from numpy import ma
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
@staticmethod
def objective_function(w,X,y,C):
g = expit(X @ w)
return -np.sum(ma.log(g[y==1]))-np.sum(ma.log(1-g[y==0])) + C*sum(w**2)
@staticmethod
def objective_gradient(w,X,y,C):
g = expit(X @ w)
# get y difference
ydiff = y-g
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(w.shape)
gradient[1:] += -2 * w[1:] * C
return -gradient
def fit(self, X, y):
# add bias term
Xb = self._add_bias(X)
num_samples, num_features = Xb.shape
self.w_ = fmin_bfgs(self.objective_function, # what to optimize
np.zeros((num_features,1)), # starting point
fprime=self.objective_gradient, # gradient function
args=(Xb,y,self.C), # extra args for gradient and objective function
gtol=1e-03, # stopping criteria for gradient, |v_k|
maxiter=self.iters, # stopping criteria iterations
disp=False)
self.w_ = self.w_.reshape((num_features,1))
class MultiClassLogisticRegression:
def __init__(self, eta, iterations=20,
C=0.0001,
solver=BFGSBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.classifiers_ = []
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
# get each unique class value
self.unique_ = np.sort(np.unique(y))
num_unique_classes = len(self.unique_)
self.classifiers_ = []
# for each unique value
for i,yval in enumerate(self.unique_):
y_binary = np.array(y==yval).astype(int)
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for hblr in self.classifiers_:
# get probability for each classifier
probs.append(hblr.predict_proba(X).reshape((len(X),1)))
# make into single matrix
return np.hstack(probs)
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1)
The Steepest Ascent method is an algorithm that, at each iteration, the model takes the steepest direction it can take in order to reduce the gradient to zero or some set tolerance. This method can be useful when Newton's Method is unreliable in finding our desired outcome, when it is not possible to find the maximum of a function analytically. Thus, using the steepest ascent method allows for using an iterative method for obtaining an approximate solution.
class LogisticRegression:
def __init__(self, eta, iterations=20):
self.eta = eta
self.iters = iterations
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_)
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
# get each unique class value
self.unique_ = np.unique(y)
num_unique_classes = len(self.unique_)
self.classifiers_ = []
# for each unique value
for i,yval in enumerate(self.unique_):
# create a binary problem
y_binary = (y==yval)
# train the binary classifier for this class
blr = VectorBinaryLogisticRegression(self.eta,
self.iters)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for blr in self.classifiers_:
# get probability for each classifier
probs.append(blr.predict_proba(X))
# make into single matrix
return np.hstack(probs)
def predict(self,X):
return self.unique_[np.argmax(self.predict_proba(X),axis=1)]
The Stochastic Gradient Descent is an algorithm linked with random probability, a single sample, or a batch of an instance, is selected randomly instead of the whole data set for each iteration. This solves the problem having a computationally expensive problem when the whole dataset is the batch that is used in the optimization technique. The gradient descent method is used to find the local minimum of an objective function; it uses the gradient to descend the loss curve towards a minimum.`
class StochasticLogisticRegression(BinaryLogisticRegression):
# stochastic gradient calculation
def _get_gradient(self,X,y):
# grab random instance
idx = int(np.random.rand()*len(y))
# get scalar y difference
ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False)
# make ydiff a column vector and multiply through
gradient = X[idx] * ydiff[:,np.newaxis]
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return gradient
Newton’s method, rather than finding the maxima or minima, is an algorithm that finds the root of a polynomial function. This is applied to the derivative of the cost function, rather than the cost function itself. Compared to gradient descent, Newton’s method can only be applied when the cost function is differentiable twice, not just once.
from numpy.linalg import pinv
class HessianBinaryLogisticRegression(BinaryLogisticRegression):
# just overwrite gradient function
def _get_gradient(self,X,y):
# get sigmoid value for all classes
g = self.predict_proba(X,add_bias=False).ravel()
# calculate the hessian
hessian = X.T @ np.diag(g*(1-g)) @ X - 2 * self.C
# get y difference
ydiff = y-g
# make ydiff a column vector and multiply through
gradient = np.sum(X * ydiff[:,np.newaxis], axis=0)
gradient = gradient.reshape(self.w_.shape)
gradient[1:] += -2 * self.w_[1:] * self.C
return pinv(hessian) @ gradient
Overfitting is when the learned hypothesis fits the training data so well that it hurts the model’s performance on any new, unseen instances. The implementation of gradient descent has no regularization, so it is likely to overfit the training data.
We use L2 regularization, in order to help control overfitting of our model. This method forces weights to be small, but not too small to be exactly zero. By incentivizing the weights to be smaller in magnitude, it helps to prevent the sigmoid from having such a large slope that it becomes more vertical. The regularization term that we add to the loss function is the sum of squares of all the feature weights. By keeping the weights relatively small, it is the same as having smaller multipliers in the sigmoid function.
class RegularizedBinaryLogisticRegression(VectorBinaryLogisticRegression):
# extend init functions
def __init__(self, C=0.0, **kwds):
# need to add to the original initializer
self.C = C
# but keep other keywords
super().__init__(**kwds) # call parent initializer
# extend previous class to change functionality
def _get_gradient(self,X,y):
# call get gradient from previous class
gradient = super()._get_gradient(X,y)
# add in regularization (to all except bias term)
gradient[1:] += -2 * self.w_[1:] * self.C
return gradient
class RegularizedLogisticRegression(LogisticRegression):
def __init__(self, C=0.0, **kwds):
# need to add to the original initializer
self.C = C
super().__init__(**kwds)
def fit(self,X,y):
num_samples, num_features = X.shape
# get each unique class value
self.unique_ = np.unique(y)
num_unique_classes = len(self.unique_)
self.classifiers_ = []
# for each unique value
for i,yval in enumerate(self.unique_):
# create a binary problem
y_binary = y==yval
blr = RegularizedBinaryLogisticRegression(eta=self.eta,
iterations=self.iters,
C=self.C)
blr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(blr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
%%time
# Steepest Descent
lr = LogisticRegression(0.1,1500)
lr.fit(X,y)
yhat1 = lr.predict(X)
print(f'Accuracy of Model with Steepest Ascent: {mt.accuracy_score(y,yhat1) * 100}%')
# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax = plt.subplots(figsize=(10,10))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat1)), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Steepest Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
Accuracy of Model with Steepest Ascent: 75.27345728812924%
CPU times: total: 57.2 s Wall time: 49.2 s
%%time
# Binary Stochastic Gradient Descent
slr = MultiClassLogisticRegression(eta=0.05, iterations=2500, C=0.0001, solver = StochasticLogisticRegression)
slr.fit(X,y)
yhat2 = slr.predict(X)
print(f'Accuracy of Model with Stochastic Gradient Descent: {mt.accuracy_score(y,yhat2) * 100}%')
# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax = plt.subplots(figsize=(10,10))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat2)), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Stochastic Gradient Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
Accuracy of Model with Stochastic Gradient Descent: 67.00500947484943%
CPU times: total: 438 ms Wall time: 431 ms
%%time
# Newton's Method
hlr = HessianBinaryLogisticRegression(eta=0.1, iterations=4, C=0.01)
hlr.fit(X,y)
yhat3 = hlr.predict(X)
print(f'Accuracy of Model with Newton\'s Method: {mt.accuracy_score(y,yhat3) * 100}%')
# plot heatmap of confusion matrix
class_names = df2.price_range.unique()
fig,ax = plt.subplots(figsize=(10,10))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat3)), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of Newton\'s Method Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
Accuracy of Model with Newton's Method: 17.390570179553087%
CPU times: total: 7min 7s Wall time: 26.7 s
When it comes to all of our custom models, our steepest ascent model performed the best with a 77% accuracy, a significant difference from our Stochastic Gradient Descent being 60% and Newton's Method being 26%. Therefore, we will move on with our steepest ascent model.
In order to determine whether our accuracy of our custom model is good, we need a standard model to compare it to. We'll be using the scikit-learn logistic regression model for this.
# import the necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logModel = LogisticRegression(random_state=16)
# fit the model
logModel.fit(X, y)
# make predictions
yhat4 = logModel.predict(X)
# print accuracy percentage of model
print("Accuracy: ", metrics.accuracy_score(y, yhat4) * 100,"%")
Accuracy: 89.84221092328187 %
c:\users\ibrahim\code\saminas_homework\machine_learning_imag_processing\.venv\image_processing\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
# import the necessary libraries
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
# print the confusion matrix
c_matrix= metrics.confusion_matrix(y, yhat4)
class_names = df2.price_range.unique()
fig,ax = plt.subplots(figsize=(10,10))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(c_matrix), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Scikit Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# print accuracy of steepest ascent model
print("Accuracy: ", metrics.accuracy_score(y, yhat1) * 100,"%")
# print accuracy of scikit model
print("Accuracy: ", metrics.accuracy_score(y, yhat4) * 100,"%")
Accuracy: 75.27345728812924 % Accuracy: 89.84221092328187 %
Against the results of both models, the scikit model outperforms our custom model by a significant margin. Not only is it more accurate, but it is also significantly faster.
We will use heatmaps of their performance to analyze further details.
# print the confusion matrix
class_names = df2.price_range.unique()
fig,ax = plt.subplots(figsize=(20,10))
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
plt.subplot(1,2,1)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat1)), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Steepest Descent Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.subplot(1,2,2)
# create heatmap
sns.heatmap(pd.DataFrame(mt.confusion_matrix(y,yhat4)), annot = True, cmap = "YlGnBu" ,fmt = 'g')
plt.tight_layout()
plt.title('Confusion matrix on Correct and Incorrect Predictions of the Scikit Model', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
C:\Users\Ibrahim\AppData\Local\Temp\ipykernel_5788\3598540272.py:9: MatplotlibDeprecationWarning: Auto-removal of overlapping axes is deprecated since 3.6 and will be removed two minor releases later; explicitly call ax.remove() as needed.
When it came to making correct predictions on Label 0 and Label 4 diamonds, our model ourperformed the scikit model. However, the scikit model did better when it came to correctly predicting the other types of diamonds. Additionally, our model made more wrong predictions in sections adjacent to the diagonal values in comparison to the scikit model by a significant margin (170 vs 469, 1322 vs 549).
We recommend the scikit model over our own models. Not only is the scikit model more accurate, but it is much more faster. Since our model was created from scratch, we may have missed opportunities to optimize our code, in comparison to the python library that has been extensively developed and expanded upon.
params = dict(eta=0.01, iterations=500)
class MSEVectorBinaryLogisticRegression(BinaryLogisticRegression):
# inherit from our previous class to get same functionality
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
# but overwrite the gradient calculation
def _get_gradient(self,X,y):
ydiff = np.square(y-self.predict_proba(X,add_bias=False).ravel()) # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
return gradient.reshape(self.w_.shape)
# use same params as defined above
MSEblr = MSEVectorBinaryLogisticRegression(**params)
MSEblr.fit(X,y)
# allow for the user to specify the algorithm they want to sovle the binary case
class MseLogisticRegression:
def __init__(self, eta, iterations=20,
C=0.0001,
solver=MSEVectorBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.sort(np.unique(y)) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = np.array(y==yval).astype(int) # create a binary problem
# train the binary classifier for this class
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for hblr in self.classifiers_:
probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
%%time
lr = MseLogisticRegression(eta=1, iterations=500, C=0.001, solver=BFGSBinaryLogisticRegression)
lr.fit(X,y)
yhat = lr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat))
Accuracy of: 0.7801459689675229 CPU times: total: 2.77 s Wall time: 2.44 s
We tried the BFGS model and we did not geta promising accuracy, so we tried to implement option 1 that is implemented techniques for logistic regression using mean square error, and we had a better accuracy.
The BFGS algorithm is a quasi-Newton method for optimization problems. In this method, we use the second derivative of the objective function to find the direction of the next step and take that step. This approach can be more efficient than methods that only use the gradient of the objective function because it takes into account the curvature of the function.
bfgslr = BFGSBinaryLogisticRegression(_,iterations = 2, C = 0.001) # note that we need only a few iterations here
bfgslr.fit(X,y)
yhat4 = bfgslr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat4))
Accuracy of: 0.24914163492748456
In the context of logistic regression, we can use BFGS to optimize the objective function that we want to maximize/minimize. The objective function in logistic regression is the log-likelihood function, and we want to maximize it to find the optimal weights that separate the classes. We can use BFGS to find the optimal weights by iteratively updating the weights based on the negative gradient and the approximated Hessian matrix until convergence.
In this case, the BFGS method is used to optimize the parameters of a logistic regression model to predict the binary outcome of whether a diamond is "good" or "bad" based on its features. Specifically, the BFGSBinaryLogisticRegression class is used to train a binary logistic regression model for each unique class value of the target variable. The MultiClassLogisticRegression class then uses these binary classifiers to make multiclass predictions.
# allow for the user to specify the algorithm they want to sovle the binary case
class MultiClassLogisticRegression:
def __init__(self, eta, iterations=20,
C=0.0001,
solver=BFGSBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.sort(np.unique(y)) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = np.array(y==yval).astype(int) # create a binary problem
# train the binary classifier for this class
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for hblr in self.classifiers_:
probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
# run logistic regression and vary some parameters
from sklearn import metrics as mt
# first we create a reusable logisitic regression object
# here we can setup the object with different learning parameters and constants
lr_clf = MultiClassLogisticRegression(eta=1,
iterations=500,
C=0.0001,
solver=BFGSBinaryLogisticRegression
)
# now we can use the cv_object that we setup before to iterate through the
# different training and testing sets. Each time we will reuse the logisitic regression
# object, but it gets trained on different data each time we use it.
iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y):
# I will create new variables here so that it is more obvious what
# the code is doing (you can compact this syntax and avoid duplicating memory,
# but it makes this code less readable)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# train the reusable logisitc regression model on the training data
lr_clf.fit(X_train,y_train) # train object
y_hat = lr_clf.predict(X_test) # get test set precitions
# now let's get the accuracy and confusion matrix for this iterations of training/testing
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print("====Iteration",iter_num," ====")
print("accuracy", acc )
print("confusion matrix\n",conf)
iter_num+=1
# Also note that every time you run the above code
# it randomly creates a new training and testing set,
# so accuracy will be different each time
====Iteration 0 ==== accuracy 0.8273921200750469 confusion matrix [[2289 160 1 0 0] [ 319 1781 249 0 0] [ 0 307 1436 300 0] [ 1 6 236 1365 193] [ 0 0 1 67 1949]] ====Iteration 1 ==== accuracy 0.8227954971857411 confusion matrix [[2285 158 1 0 0] [ 355 1738 270 0 0] [ 0 308 1405 308 0] [ 0 7 208 1401 200] [ 0 0 4 70 1942]] ====Iteration 2 ==== accuracy 0.8314258911819887 confusion matrix [[2336 156 0 0 0] [ 305 1724 262 0 0] [ 2 316 1447 283 0] [ 0 7 216 1365 186] [ 0 0 4 60 1991]]
from ipywidgets import widgets as wd
num_cv_iterations = 10
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
test_size = 0.2)
def lr_explor(cost):
print('Running')
lr_clf = MultiClassLogisticRegression(eta=1,
iterations=500,
C=float(cost),
solver=BFGSBinaryLogisticRegression
)
acc = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
lr_clf.fit(X[train_indices],y[train_indices]) # train object
y_hat = lr_clf.predict(X[test_indices]) # get test set predictions
acc.append(mt.accuracy_score(y[test_indices],y_hat))
acc = np.array(acc)
print(acc.mean(),'+-',2.7*acc.std())
wd.interact(lr_explor,cost=list(np.logspace(-4,1,15)),__manual=True)
interactive(children=(Dropdown(description='cost', options=(0.0001, 0.00022758459260747887, 0.0005179474679231…
<function __main__.lr_explor(cost)>
We don't have any data snooping due to even distribution of classes. Our model expect that if we collect another 20 percent of data, those C and optimization work well with new set of data. Data snooping is anologous to overfitting and even distribution helps allievate both conditions.
%%time
# alternatively, we can also graph out the values using boxplots
num_cv_iterations = 10
num_instances = len(y)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
test_size = 0.2)
def lr_explor(cost):
lr_clf = MultiClassLogisticRegression(eta=1,
iterations=500,
C=float(cost),
solver=BFGSBinaryLogisticRegression
)
acc = []
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
lr_clf.fit(X[train_indices],y[train_indices]) # train object
y_hat = lr_clf.predict(X[test_indices]) # get test set predictions
acc.append(mt.accuracy_score(y[test_indices],y_hat))
acc = np.array(acc)
return acc
costs = np.logspace(-5,1,20)
accs = []
for c in costs:
accs.append(lr_explor(c))
CPU times: total: 6min 37s Wall time: 5min 31s
# now show a boxplot of the data across c
from matplotlib import pyplot as plt
%matplotlib inline
plt.boxplot(accs)
plt.xticks(range(1,len(costs)+1),['%.4f'%(c) for c in costs],rotation='vertical')
plt.xlabel('C')
plt.ylabel('validation accuracy')
plt.show()
%%time
lr = MultiClassLogisticRegression(eta=1,
iterations=500,
C=0.001,
solver=BFGSBinaryLogisticRegression
)
lr.fit(X,y)
print(lr)
yhat = lr.predict(X)
print('Accuracy of: ',accuracy_score(y,yhat))
MultiClass Logistic Regression Object with coefficients: [[-7.43513578e+00 -1.30449904e+00 1.82580447e-02 -1.81746589e-01 -2.64026020e+00 -2.50768781e+00 -2.55068645e+00 -4.73933978e-01 -3.79413903e-01 -2.16055418e-01 -2.74123176e-02 3.60969134e-01 5.66155500e-01 4.38001620e-01 2.75640245e-01 -5.36650104e-01 4.08579709e-01 5.73903426e-01 -1.96701761e-01 9.22558919e-04 -5.60397354e-01 -3.95241951e-01 -1.50433423e-01 1.48452725e-01 -9.61263084e-02 -1.57340966e-01 2.36215517e-01] [-1.99941906e+00 -4.07153588e+00 -3.15012236e-02 7.18268896e-03 9.20573324e-01 4.40829759e-01 9.14292689e-01 9.27647179e-02 8.55000267e-02 1.89618755e-02 2.68520743e-02 -1.28301612e-01 -1.51520210e-01 4.40133747e-02 1.02730835e-01 1.10235723e-01 -4.79696972e-02 -1.10337441e-01 9.13837653e-03 3.11201732e-02 8.96042602e-02 -2.95465082e-02 9.60909389e-02 -4.31997847e-02 5.19241627e-02 4.90909926e-02 -1.21510945e-01] [-1.72683155e+00 -4.99047880e+00 1.14335731e-01 1.69083301e-01 1.70209867e+00 1.55883990e+00 1.55502148e+00 -2.40995100e-02 6.37551723e-03 4.98727510e-02 -6.78023038e-02 -4.22641866e-02 7.55839034e-02 2.98906533e-02 1.43113579e-01 -5.23278017e-02 -2.64694528e-02 1.59819927e-01 -2.83810422e-02 -1.03747999e-01 -4.43115014e-02 3.03153924e-02 8.00034953e-02 3.16831773e-02 6.74306579e-03 -8.25780755e-02 2.39927976e-02] [-2.46600858e+00 -3.70381977e+00 1.50585276e-01 1.04001245e-01 1.80472273e+00 1.52193469e+00 1.63054987e+00 3.38299308e-02 -2.41419179e-02 -2.96648767e-02 -3.72413018e-02 1.03436161e-01 -2.86532499e-02 -1.83425631e-02 -4.16141092e-02 -4.78540331e-01 3.50933205e-01 2.22325428e-01 -5.00396510e-02 2.77472900e-02 -2.13657179e-01 -2.85254650e-01 -1.92607839e-02 7.21310770e-02 -9.59292798e-02 1.71178337e-02 5.30153626e-02] [-6.25907170e+00 2.48069909e+00 -2.49208932e-02 -7.21848750e-02 1.73291457e+00 1.69324600e+00 1.64840286e+00 3.30656499e-01 3.34569222e-01 3.00556510e-01 1.73881675e-01 -3.02006099e-01 -5.42128242e-01 -6.88151437e-01 -8.08599163e-01 5.72484245e-01 -6.71072951e-01 -1.16572076e+00 5.82361574e-01 2.31802741e-01 6.42714046e-01 8.52669161e-01 -2.45161702e-01 -1.12645611e-01 1.63526432e-01 -5.31788752e-02 4.05805388e-02]]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) File <timed exec>:10 NameError: name 'accuracy_score' is not defined
from matplotlib import pyplot as plt
import copy
%matplotlib inline
plt.style.use('ggplot')
def plot_decision_boundaries(lr,Xin,y,title=''):
Xb = copy.deepcopy(Xin)
lr.fit(Xb[:,:2],y) # train only on two features
h=0.01
# create a mesh to plot in
x_min, x_max = Xb[:, 0].min() - 1, Xb[:, 0].max() + 1
y_min, y_max = Xb[:, 1].min() - 1, Xb[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# get prediction values
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.5)
# Plot also the training points
plt.scatter(Xb[:, 0], Xb[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Independent Value')
plt.ylabel('Dpendent value')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(title)
plt.show()
lr = BFGSBinaryLogisticRegression(0.1,1500) # this is still OUR LR implementation, not sklearn
plot_decision_boundaries(lr,X,y)
params = dict(eta=0.01, iterations=500)
class MSEVectorBinaryLogisticRegression(BinaryLogisticRegression):
# inherit from our previous class to get same functionality
@staticmethod
def _sigmoid(theta):
# increase stability, redefine sigmoid operation
return expit(theta) #1/(1+np.exp(-theta))
# but overwrite the gradient calculation
def _get_gradient(self,X,y):
ydiff = np.square(y-self.predict_proba(X,add_bias=False).ravel()) # get y difference
gradient = np.mean(X * ydiff[:,np.newaxis], axis=0) # make ydiff a column vector and multiply through
return gradient.reshape(self.w_.shape)
# use same params as defined above
MSEblr = MSEVectorBinaryLogisticRegression(**params)
MSEblr.fit(X,y)
# allow for the user to specify the algorithm they want to sovle the binary case
class MseLogisticRegression:
def __init__(self, eta, iterations=20,
C=0.0001,
solver=MSEVectorBinaryLogisticRegression):
self.eta = eta
self.iters = iterations
self.C = C
self.solver = solver
self.classifiers_ = []
# internally we will store the weights as self.w_ to keep with sklearn conventions
def __str__(self):
if(hasattr(self,'w_')):
return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
else:
return 'Untrained MultiClass Logistic Regression Object'
def fit(self,X,y):
num_samples, num_features = X.shape
self.unique_ = np.sort(np.unique(y)) # get each unique class value
num_unique_classes = len(self.unique_)
self.classifiers_ = []
for i,yval in enumerate(self.unique_): # for each unique value
y_binary = np.array(y==yval).astype(int) # create a binary problem
# train the binary classifier for this class
hblr = self.solver(eta=self.eta,iterations=self.iters,C=self.C)
hblr.fit(X,y_binary)
# add the trained classifier to the list
self.classifiers_.append(hblr)
# save all the weights into one matrix, separate column for each class
self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
def predict_proba(self,X):
probs = []
for hblr in self.classifiers_:
probs.append(hblr.predict_proba(X).reshape((len(X),1))) # get probability for each classifier
return np.hstack(probs) # make into single matrix
def predict(self,X):
return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
%%time
lr = MseLogisticRegression(eta=1, iterations=500, C=0.001, solver=BFGSBinaryLogisticRegression)
lr.fit(X,y)
yhat = lr.predict(X)
print('Accuracy of: ',mt.accuracy_score(y,yhat))
Accuracy of: 0.7801459689675229 CPU times: total: 2.69 s Wall time: 2.4 s
We tried the BFGS model and we did not geta promising accuracy, so we tried to implement option 1 that is implemented techniques for logistic regression using mean square error, and we had a better accuracy.
Code used in this notebook was adapted from Professor Larson's notebooks for this class.
We appreciate and acknowledge Professor Larson's base model.